Conversation
smklein
left a comment
There was a problem hiding this comment.
Overall structure looks good, but I've got some questions about lifetimes of things
| let class = class.into(); | ||
| match self | ||
| .datastore | ||
| .alert_create(&opctx, id, class, payload.clone()) |
There was a problem hiding this comment.
I don't think we're deleting alert records yet - AFAICT, we're marking them dispatched, but leaving rows in CRDB for them - but when we do, this will be something we need to consider.
- Suppose we want to delete an alert record from cockroachdb
- Suppose there is a really laggy Nexus somewhere, running this rendezvous task. It's stuck doing rendezvous for a very old sitrep.
- If we do "actual SQL DELETE" of the alert, this background task could theoretically bring it back to life (which would be a bug)
I don't think this problem has been totally solved for blueprints either - I'm not seeing such guards in reconcile_blueprint_rendezvous_tables either - but from a discussion with @jgallagher , the priority there was lower, because the rendezvous tables for blueprints are much lower-churn than they presumably will be for alerts.
I wrote up an issue for this on the blueprint side with #9592 , but I think it'll be relevant here much sooner, especially as each alert is injecting an arbitrary JSON payload, which means the table is going to grow in size more quickly.
There was a problem hiding this comment.
You're correct that the alert records are currently never deleted. I think we do need to figure out a strategy for that (see #8076). For the near term, I think we ought to add a nullable case_id column to the alert record, so that alerts created by FM alert requests can point back to the FM case that owns them --- we might want to use that to determine if FM alerts are safely deleteable by checking if that case is still in the current sitrep, or something. WDYT?
There was a problem hiding this comment.
Makes sense to me - do we have something similar for webhook_delivery_attempts? I'm noticing that schema references the alert record, and presumably would want to stop if the case was closed?
There was a problem hiding this comment.
I don't think we would want to stop attempting delivery if a case is closed, necessarily. For instance, we might be trying to deliver an alert for the resolution of an Active Problem, which closes the case.
There was a problem hiding this comment.
I was mostly just thinking that we would avoid deleting alerts which were requested by a case that still exists in the current sitrep as a way to avoid accidentally re-creating those alerts in reconciliation.
There was a problem hiding this comment.
hrmmm so in #9592 I mentioned a couple possible solutions to avoid re-creating alerts. I figured we were leaning towards option (1) - "guard the INSERT" operation - which would attempt to prevent re-creating alerts if our state is out-of-date. How would that work with a new case_id column on the alert table?
When we're doing reconciliation for a sitrep, we load cases, and generate alerts based on those cases, right? So from the point-of-view of a really slow Nexus acting on old data, it thinks there is a valid case associated with the alert, no? Or are you suggesting "only insert the alert if there exists a corresponding fm_case in the database which appears open"?
There was a problem hiding this comment.
Oh, I was imagining that the case ID column would be used to guard against deleting an alert if it was requested by a case that is still open. Sorry if that was unclear!
There was a problem hiding this comment.
Ah, so we'd still need a guard for the INSERT in that case, right? The code as-is operates on a sitrep that has been loaded into a particular Nexus's memory -- so it might actually be creating an alert for a fully closed case, unless we have some way to stop it in create_requested_alerts
There was a problem hiding this comment.
Yeah, that's correct. I still need to work out how I want to do that.
Co-authored-by: Sean Klein <sean@oxide.computer>
smklein
left a comment
There was a problem hiding this comment.
LGTM modulo the question about alerts - I want to be cautious merging and adding alerts, without having a known back-stop for them. If we don't think we're going to be generating alerts immediately after this is merged, we can be more lenient - I just want to make sure we either fix it or keep our eye on it.
| let class = class.into(); | ||
| match self | ||
| .datastore | ||
| .alert_create(&opctx, id, class, payload.clone()) |
There was a problem hiding this comment.
Makes sense to me - do we have something similar for webhook_delivery_attempts? I'm noticing that schema references the alert record, and presumably would want to stop if the case was closed?
42e1650 to
6e3edde
Compare
This branch builds on #9492 by adding alert requests to fault management cases. This is a mechanism to allow a sitrep to specify that a set of alerts should exist. UUIDs and payloads for these alerts are specified in the sitrep. We don't want entries in the
alerttable to be created immediately when a sitrep is inserted, as that sitrep may not be made current. If the alert dispatcher operates on alerts created in sitreps that were not made current, it could dispatch duplicate or spurious alerts. Instead, we indirect the creation of alert records by having the sitrep insertion create alert requests, and if that sitrep becomes current, a backgroundfm_rendezvoustask reconciles the requested alerts in the sitrep with the actualalerttable. Eventually, this task will be responsible for updating other rendezvous tables based on the current sitrep.I also did a bit of refactoring of the alert class types so that the structured enum of alert classes could be used by the sitrep.
This change was originally factored out from #9346, but I ultimately ended up rewriting a lot of it.